Method for determining and recording system information and function in distributed parallel component based software systems

ABSTRACT

A universal test or monitoring of distributed, parallel component based software systems can be achieved, which minimally affects the system for monitoring, automatically provides the correct system components with inspection instruments, a semantic connection of communication reports between each transmitter and receiver and a logical event sequence without synchronised clock time for the individual components of the distributed system in the form of an evaluation model and can find application on any form of evaluation tool.

[0001] The invention relates to a method for tracing, that is fortracking, sequences of activities, via event messages from a systemwhich is to be observed, where the system which is to be observed mayhave several processes, tasks or threads and can execute on differentdevices.

[0002] Distributed parallel-running component-based software systems ofthis type, such as for example Microsoft's COM, CORBA or the EnterpriseJava Beans systems, are for example presented briefly in the book byJason Pritchard with the title “COM and CORBA Side by SideArchitectures, Strategies and Implementations, Addison-Wesley, 1999,pages 17 to 25.

[0003] In all phases of their development, that is during theimplementation, integration and testing of software systems, as well asin use, that is during their commissioning and operational monitoring,there is a need to be able to inspect and evaluate system run time datafor the purposes of analysis, fault localization or to demonstrate thecorrectness of the software. This data covers system states, such as forexample internal variables, and data about communication activities andevents including their time sequence.

[0004] Until now, the following have been the familiar methods oftracing:

[0005] a) A debugger, under the control of which every system process isexecuted and which permits the interactive setting of breakpoints andthe inspection of debugging data, that is the contents of local andglobal variables, when a breakpoint is reached. However, it isspecifically the case in distributed parallel-running systems that thissolution approach only permits local inspection, which can only withdifficulty be used to make statements about the overall system. Apartfrom which, the system behavior is sensitive to disruption or evencrashing if individual components/processes are halted interactively.

[0006] b) An additional item of debugging information in the programcode, i.e. the names of objects, variables etc. plus details concernedwith the mapping of source code lines to machine code, for furtheranalysis, with this debugging information generally being used for erroranalysis after system crashes, that is for a post mortem analysis. Thisapproach again offers a view of the system which is essentially only alocal one. With post mortem analyses it is often not possible—using onlya knowledge of the final state of the system—to draw any conclusionabout the actual cause of the error. Furthermore, particularly in thecase of small embedded systems with limited resources, the softwareshould be supplied with no debug data, to keep it as compact aspossible.

[0007] c) Pre-instrumented code, that is supplementary program codewhich is a permanent part of the finished system, and which is activatedwhen required to generate relevant system data. This approach too isoften hardly usable on small systems due to resource scarcity, or theinstrumentation may be restricted to some parts only of the system. Theinstrumentation is static, and cannot subsequently be changed, that isto say it is only possible to show system data from places in the codewhere provision was made for doing so back at the time ofimplementation.

[0008] Distributed applications in particular are distinguished by theirsize and complexity, as a consequence of which they can only be testedwith difficulty if the normal means are used.

[0009] Such methods or systems are described in WO 2000/55733 A1, DE 4323 787 A1, U.S. Pat. No. 5,790,858, EP 0 470 322 A1, U.S. Pat. No.5,307,498 and U.S. Pat. No. 5,371,746 for example.

[0010] The object underlying the invention now consists in specifying amethod or device by which the disadvantages set out above are avoided.

[0011] In accordance with the invention, this object is achieved interms of the method by the characteristics of patent claim 1 and interms of the device by the characteristics of patent claim 4.

[0012] The further claims concern preferred developments of the methodor device, as applicable.

[0013] The invention creates a general capability to test or observedistributed parallel-running component-based software systems, whichmakes only minimal changes to the system to be observed, automaticallyprovides inspection instruments to the correct system components, makesavailable a semantic assignment of communication messages between thetransmitter and receiver concerned, and without having synchronized timeon the individual components of the distributed system provides alogical event sequence in the form of an evaluation model, which can beused as the basis for any required type of evaluation tools.

[0014] The invention is explained in more detail below by reference toan exemplary form of embodiment shown in the drawings. These show

[0015]FIG. 1 a summary view to explain the method or device inaccordance with the invention, as applicable,

[0016]FIG. 2 a diagram to explain the automatic instrumentation,

[0017]FIG. 3 a diagram to explain how the so-called call-instrumentationworks,

[0018]FIG. 4 a diagram to explain the abstract design of a model used inthe method according to the invention,

[0019]FIG. 5 a diagram to explain the application structure of thismodel, and

[0020]FIG. 6 a diagram to explain the physical structure of this model.

[0021]FIG. 1 shows a framework program FACT, for automated componenttracing, which receives its input information IN in the form of messagesfrom a system SUB which is under observation, and which itself providesoutput information OUT, either online or offline, to visualization oranalysis tools TOOLS. Depending on data from a configuration moduleCONF, an implantation or injection of inspection instruments I1. . . I3is effected with the help of an instrumentation module INST into systemcomponents SC1 and SC2, selected from among the system components SC1 .. . SC3 on the basis of the configuration, or into a so-called item ofmiddleware MID, which is described in more detail below. Optionally, aninspection component IC can also be implanted or injected into thesystem under observation, SUB. The inspection instruments I1 and I2 inthe system components SC1 and SC2 then supply their messages eitherdirectly or by making use of the inspection component IC, which then inturn supplies the messages IN for the framework program FACT. Theframework program FACT has an observation part OBS, which collects theIN messages and from which, filtered in accordance with items of typeinformation TI, which are also available to the configuration module,they are then passed to a transformation module TRANS, where they arethen, depending on the configuration module, passed as OUT data for ageneral evaluation model to the tools TOOLS, for visualization oranalysis. This division of the system into three parts should be seen asa purely logical structure, with no implications about the physicalstructuring across several participatory processes or devices.

[0022] The global sequence of activities of the framework program FACTbreaks down essentially into two procedures:

[0023] a) The instrumentation: Taking into account the configurationdata from the CONF module and the type information TI, the automaticinstrumentation component or module, INST, implants the instruments I1,I2 or I3 in the system components SC1 and SC2 of the system SUB underobservation. This procedure takes place as soon as a component iscreated, that is both when the system is started and also when anysubsequent dynamic expansion of the system SUB takes place at run time.

[0024] b) The recording and forwarding of system information: Messagesare generated in the inspection component IC and are forwarded to theobserver OBS, which collects these messages and filters them dependingon the configuration module or the type information, as applicable,before forwarding them to the transformation module. In the TRANSmodule, the type information TI is used to effect a conversion andforwarding to the visualization or analysis facilities.

[0025] The automatic instrumentation of application components in orderto extract system data and of the documentation of dynamic systemactivity sequences is implemented by the combination and application ofmethods of system observation and enhancement which are in themselvescommon, but by comparison with other methods offers a range ofadvantages:

[0026] 1. The instrumentation is effected at the system run time, thatis to say the system design and system structure are not affected by it.

[0027] 2. For any particular instrumentation, no new compilation of theprogram code is required, the method can also be applied as softwaresupplied in release version form or on binary components supplied byother manufacturers.

[0028] 3. All the relevant parts of a program can be instrumented, theessential point here being that components which are created atexecution time can be automatically and retrospectively instrumented.This makes it possible to obtain a view of the entire system and notsimply of the static parts, that is the parts which are known at thetime of start-up. The scope of the instrumentation and the data which isto be captured can also be configured at the time of execution.

[0029] 4. The instrumentation code is very compact and hence it can alsobe used for embedded systems with limited resources.

[0030] 5. The execution of the instrumentation code is very efficient.

[0031] 6. The behavior of the application is only slightly affected.

[0032] The INST module for automatic instrumentation is used at twopoints in time. First, when the system to be investigated, SUB, isstarted, whereby provision is made by the instrumentation of an item ofso-called “Middleware” with the “Create” instrumentation I3 for ensuringthat the framework program FACT is informed when a new component iscreated in the system SUB which is to be observed. Here, the middleware,MID, is to be regarded as a piece of software which mediates between anapplication and the network layer beneath it. The middleware exports itsfunctionality via a defined program interface (API) and, for example,implements the interaction between parts of the application which areexecuted on devices within various platforms. The second point in timerelates to the creation of a system component at some later point intime, while the system SUB is being executed. This involves the “Create”part of the instrumentation, I3, in implanting a “Call” instrumentationcomponent I1 in the newly created component SC1, to monitor all thecalls of this component. These relationships are clearly shown in FIGS.2 and 3.

[0033] The instrumentation components I1,I2 and 13 are directly linkedto the appropriate parts of the system, SC1, SC2 and MID, that is thesystem components or middleware, and form one unit. This does not alterany of the component interfaces, so that there are no effects on theoverall system. The inspection component is implemented as a genericobject, that is to say it receives data from various instrumented codelocations. An advantage of this is that there is only one copy of theinstrumentation code for each device, but it can be addressed fromwithin different processes, which saves on memory space. Theconfiguration module CONF enables a selection to be made of systemcomponents which are to record messages when the system observationinformation is being collected. This configuration module uses the typeinformation TI, with which all the system components SC1 . . . SC3 canbe uniquely identified, and which contains among other details readablenames for the system elements. The instrumentation component comparesthe types of the system components with the planned data generationtypes, and decides on the instrumentation for each component. Thisensures that the data generated is exclusively that which the userplanned for it, and in addition prevents any possible overflow of buffermemories.

[0034] Within the framework program, the observation module OBS isresponsible for collecting the raw data from the various inspectioninstruments, and for the selection or filtering, as applicable, of datato be passed on to transformation module. The configuration prescribeswhich items of data are to be selected or filtered. The configurationcan be defined in advance or at the time of execution of the system. Theobservation module OBS can also be configured—independently of theconfiguration of the instrumentation—with respect to the components orobjects to be traced, the contents of the trace and the depth of thetrace, that is the level of detail of the system data which iscollected.

[0035] The transformation module TRANS undertakes the task of processingand supplementing the raw data collected in the various parts of thesystem to be observed, SUB. The OUT data items are for a model, and aretransferred to visualization/analysis tools via a defined interface. Themodel is chosen to be flexible, allowing a host of possible tools to beused. In the transformation module, the logical and communicationinterrelationships between the individual system elements arereproduced, the events are put into their correct time sequence, andstatic information about the types of the system components is added inorder to give the data a structure and to make it readable by a person,for example by using names.

[0036] The following items of input data IN are required from the systemunder observation, SUB:

[0037] 1. An identification of the components within the program unittogether with the run time environment, that is to say the currentprocess and the current thread. In general, these items of informationcan be queried using operating system functions, whereby a uniqueidentification ID is provided for each device. Such an ID represents aunique identification for an object, and in most cases consists of annon-negative integer value.

[0038] 2. Type information about the system elements which are to beobserved, that is, about the components, objects, interfaces and globalfunctions. The type information specifies the abstract structure of thesystem elements. It is specified at the point in time when theapplication is designed, and is generally available in a well-definedformat. The framework program FACT itself does not absolutely requirethis information, but the tools TOOLS which are based on this frameworkprogram can use it to document the design structure and to implement apresentation which can be read by a person.

[0039] In detail, the following parts of the type information arerelevant:

[0040] Names of the system elements or sometimes even just the uniqueidentification numbers, the IDs.

[0041] For objects and interfaces: a list of methods and theirsignature, that is to say details of the parameters and return valuesfor the method.

[0042] For components: a list of the interfaces for these components.

[0043] For interfaces: a list of the methods which the interfaceprovides for, and their signatures.

[0044] In the case where system elements make use, by inheritance oraggregation, of implementations of other elements, which is provided forin the case of object-oriented programming languages, the structure ofthe inheritance/aggregation relationships is also required.

[0045] In addition, use is made of the following services of the systemto be observed:

[0046] A listing of the modules which belong to an application or theidentification of the components involved.

[0047] The injection of code blocks or libraries into external processesin order to effect the instrumentation.

[0048] Interception of system functions, in particular of functionswhich create objects or components, so that components which aregenerated during the execution time can also be monitored, and theinterception of registration messages and method calls.

[0049] Some of these services are available from the operating system orare implemented in the FACT module by the use of familiar methods.

[0050] From the raw IN data, the transformation derives the followingitems of data for a model:

[0051] information about system elements, categorized by variousaspects: relating to the abstract design, the application structure andthe physical run time structure,

[0052] information about the system structure, that is the relationshipsof the system elements to each other, such as for example theassociation of processes with devices, where this association is notstatic but may change over the duration of the execution time,

[0053] items of data which specify the life-cycle of the systemelements,

[0054] information about local and distributed communication activitiesand events,

[0055] items of data which specify the interrelationships betweencommunicating system elements together with the logical sequence ofevents, without this requiring synchronized clocks, and

[0056] items of data which permit the unambiguous identification ofsystem elements, legible to a person, i.e. plain text names and notmerely internal IDs.

[0057] For this purpose, the FACT module generates system-wide uniqueIDs for the individual system elements and for the individualcommunication activity sequences, which are appended to the systemmessages concerned.

[0058] For a good many system elements, IDs are already issued by theoperating system. However, these are only locally unique, typically fora program unit or for a device. The transformation component combinesdifferent local IDs into a system-wide unique ID for each systemelement, or itself issues unique IDs. This makes system-wideidentification possible.

[0059] When new system elements are created, either at the time ofstart-up or at the time of execution, the type information andinformation relating to the run time environment are used to reproducein addition the relationships between the different system elements.From these items of data it is possible to infer the run time structureof the distributed system.

[0060] The items of data generated by the instrumentation to document anindividual sequence of communication activities have a logicalinterdependence which is not, however, directly contained in the datawhich has been collected. The method adds the appropriate dependenciesto this data.

[0061] For this purpose, a unique ID is generated for each individualcommunication. Every event which is part of this sequence ofcommunication activities contains this ID. This makes the followingpossible:

[0062] a semantic assignment of the communication messages from thetransmitter(s) to those of the receiver, and

[0063] the determination of the logical sequence of events withoutsynchronization of the time on the individual nodes of the distributedsystem

[0064] The output interface with the OUT signals is based on a datamodel of the system messages created by the inspection component.

[0065] The model itself can be broken down into several logical parts,which specify different aspects of the SUB, or define different views ofthe system:

[0066] the abstract design,

[0067] the application structure at the time of execution, and

[0068] the physical structure.

[0069] The abstract design of a distributed application is laid down ata very early stage of development, and is static. It is not traced bythe instrumentation, but instead is deduced from the type information.These items of data are regarded as part of this model because tools,which are based on them, can use them for identification by means ofnames which can be read by a person, for structuring and for filteringthe system elements.

[0070]FIG. 4 shows the structure of the model, in which therelationships between the distributed application, the component, theinterface, the class, the method and the global function are definedusing aggregation, inheritance and implementation relationships. Forexample, components and classes can be aggregated or nested, asappropriate, or can have an inheritance relationship to one another.

[0071] The application structure is shown in FIG. 5 and—unlike theabstract design—specifies the concrete instances of a distributedapplication which has been created or deleted during a special programrun.

[0072] The physical structure is shown in FIG. 6, and specifies the runtime environment of the distributed system, and breaks down as follows:distributed system, device, process and task/thread.

[0073] The model defines the following types of system events, which aregenerated by the inspection component:

[0074] Communication events: send/receive

[0075] Registration events: create/destroy

[0076] Local events: pass

[0077] Periodic events: periodic update

[0078] All events contain a timestamp, that is the local device time,their type, and a reference to the system elements involved. In the caseof a communication event, there are always two of these, for aregistration event, local event or periodic event there is one in eachcase.

[0079] Communication events are used to describe the followingoperations:

[0080] “send”: can be an outgoing method call (send call), an outgoingdata transfer (send message), or an outgoing return value (send return)

[0081] “receive”: can be an incoming method call (receive call), anincoming data transfer (receive message), or an incoming return value(receive return)

[0082] Examples of registration events are:

[0083] “new”: documents the creation/setting up of a new component or anobject by another system element. When they are being created, there isgenerally also a communication (data exchange) between the componentsconcerned, similar to the communication event “send”.

[0084] “delete”: documents the deletion of a component or of an objectby another system element.

[0085] “create”: is a special form of the event new. It documents thecreation/setting up of a new component or an object, but withoutreproducing the relationship to the creating element.

[0086] “destroy”: documents the deletion of a component or an object byitself

[0087] Local events (pass) are events which are generated when aparticular place in the code is reached. They can document theprocessing of particular sections of code or can also contain internaldata, e.g. the content of local variables.

[0088] Periodic events (periodic update) are useful in that they can beused to monitor system variables, or other quantities defined by theuser, at periodic intervals.

[0089] The method in accordance with the invention and the device inaccordance with the invention can basically be used in all MW-baseddistributed systems, such as COM, CORBA, EJB, which satisfy therequirements set out above in relation to the system SUB which is to beobserved.

[0090] An implementation exists for various Windows™ systems, includingWindows CE™ and COM™. The method for instrumentation, for example, isbased on methods currently used under Windows™/COM™, such as:

[0091] 1. Delegation of method calls,

[0092] 2. Loading of dynamically downloadable libraries (DLLs) intoexternal processes or processes which are already being executed

[0093] 3. Redirection of Win32 API functions.

1. Method for the automatic capture and recording of system data andactivities in distributed and parallel-running component-based softwaresystems, with which, taking into account items of configuration data(CD1, . . . , CD3) and type information (TI), at least one inspectioninstrument (I1,I2,I3) is inserted into at least one component (SC1, SC2,MID) of a system (SUB) which is to be observed, with which messages (N1,N2, IN) about the system to be observed are generated in the componentswith inspection instruments at the time of execution, with which themessages are collected, and filtered in a way dependent on theconfiguration data, and with which the collected messages aretransformed into data (OUT) for a universal evaluation model in such away as to reproduce the logical and communication interrelationshipsbetween the individual system components, to order system events withinthe system under observation into their correct time sequence, and toappend static type information to the system data.
 2. Method inaccordance with claim 1, with which, when the system is started up, theinspection instrument, in the form of a create instrument (I3), iscreated only in an item of middleware (MID), where the middlewarerepresents an item of software between an application and an underlyingnetwork layer, and at least one inspection instrument in the form of acall instrument (I1, I2) is inserted into at least one system component.3. Method in accordance with claim 1 or 2, with which the inspectioninstruments supply to an inspection component (IC) messages (N1,N2)which contain instrumentation code, which is addressed by severalprocesses, and which supplies the messages (IN) for the system underobservation (SUB).
 4. Device for the automatic capture and recording ofsystem data and activities in distributed and parallel-runningcomponent-based software systems, having a configuration module (CONF)such that it makes a selection of system components, depending on anitem of type information (TI), and inspection instruments (I1, 12) areinserted into these components, having an observation module (OBS) suchthat messages (IN) from the inspection instrument, selected in theconfiguration module, can be collected, having a transformation module(TRANS) such that it is possible to compile, from the messages from theinspection instruments, the data (OUT) for at least one universalevaluation module, by the reproduction of logical and communicationinterrelationships for the messages, by ordering system events intotheir correct time sequence, and by appending static type information.5. Device in accordance with claim 4, having an instrumentation module(INST) such that at least one inspection instrument, in the form of acreate instrument (I3), is created only in an item of middleware, wherethe middleware represents an item of software between an application anda network layer, and at least one inspection instrument in the form of acall instrument (I1, I2) is inserted into at least one system component.